Spatial join is a method to combine information from different shapefiles by using spatial relationships as the join key. For example, you would like to study which county in California state has the most socioeconomically disadvantaged students, but you have two shapefiles on hand, one is a California counties shapefile without the attribute of the number of socioeconomically disadvantaged students and the other is a shapefile of California public school locations supplemented with the schools' demographic information in 2019. Let’s conduct this analysis using Python and the geopandas package. You can use the read_file() method from the geopandas package to read shapefiles, convert the coordinate reference systems of the shapefiles to an appropriate one for the geography of your data by using the to_crs() method, and then use the sjoin() method to join the shapefiles based on their spatial relationship to one another.
If you already have Anaconda downloaded and installed, you can skip Part A and directly start the analysis in Part B. Make sure you also have packages pandas, geopandas of version 0.7 or higher, matplotlib, and descartes installed in the environment where you would like to conduct this analysis. Note starting with the version 0.7, geopandas made a big change in Coordinate Reference System representation and hence some code syntax differs before and after version 0.7, as described here and here. As our tutorial writes in new version syntax, please make sure your geopandas is of version 0.7 or later in order to run the code with no error. You may check the version of geopandas and upgrade it within your environment either in Anaconda Navigator or in your terminal window.
1) First, download Anaconda. Anaconda is a free and open-source distribution of Python. You can use Anaconda to install IDEs (integrated development environments where you can write and run code) and packages like Pandas and Geopandas. Go to the link to download Anaconda, https://www.anaconda.com/products/individual, and then open the .exe file that was downloaded and follow the instructions in the installation wizard prompt.
2) Once installation is complete, open Anaconda Navigator and create a new environment for your project. A Conda environment is a directory that contains a specific collection of Conda packages that you have installed. Conda has a default environment called 'base' that includes a Python installation and some core system libraries and dependencies of Conda. It is a “best practice” to avoid installing additional packages into your base environment, and, instead, create an isolated environment to manage packages and dependencies in a new project.
Click on the Environments selection in the left sidebar menu and then click on the 'Create' at the bottom. This will open a dialog box prompting you to create a name for the new environment. You can give any name for your new environment. Here, we use 'GIS_in_Python' as the environment name. Then click the 'Create' button within the dialog box to finish the creation.
3) Once you have your project environment set up, click on the arrow to the right of your new environment, 'GIS_in_Python' in this example, and select Open Terminal. This will give you access to the command line interface on your computer in a window.
4) Install the packages/libraries necessary for the analysis by entering the following commands in the opened terminal, one line at a time:
conda install pandas
conda install geopandas
conda install matplotlib
conda install descartes
5) Once you have those libraries all installed, select the new environment, 'GIS_in_Python' in this example, in the 'Applications on' dropdown menu, and then click "install" and "launch" under Jupyter Notebook. Jupyter Notebook will open in your web browser (it does not require the internet to work).
6) In Jupyter Notebook, navigate to the folder where you saved the code file you plan to use and open the .ipynb file (the extension for Jupyter Notebook files written in Python) to run it in the Notebook. If you would like to create a new .ipynb file, browse to the folder in which you would like to save your Notebook, then click the "New" dropdown button on the top-right and select "Python 3". Your new Notebook will open in a new tab in your browser. If you want to create a new directory using the Jupyter Notebook dashboard, click the "New" dropdown button and then select "Folder". To add files from your local machine, click the "Upload" button on the top-right to open a file chooser window and then choose the file you wish to upload.
1) Import necessary packages/libraries.
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import descartes
2) Use the gpd.read_file() function from the geopandas package to read the shapefile. Optionally, you can use the head() method to return the first 5 rows of the GeoDataFrame, and use the .shape attribute to check the number of rows and columns of the GeoDataFrame in the returned tuple (number of rows, number of columns). For this example, the number of rows of the 'CA_schools' and the 'CA_counties' suggest that there are 10051 schools and 58 counties in California state in 2019.
CA_schools = gpd.read_file("CA_schools.shp")
CA_schools.head()
| OBJECTI | YerRprt | FedID | CDSCode | CDCode | SCode | ContyNm | DstrctN | ScholNm | SchlTyp | ... | FOSpct | HOMcont | HOMpct | MIGcont | MIGpct | SEDcont | SEDpct | SWDcont | SWDpct | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2019 | 6.910511e+10 | 1.100170e+12 | 110017 | 112607 | Alameda | Alameda County Office of Education | Envision Academy for Arts & Technology | High | ... | 0.3 | 0 | 0.0 | 0 | 0.0 | 314 | 81.599998 | 44 | 11.400000 | POINT (-122.26838 37.80468) |
| 1 | 2 | 2019 | 6.910511e+10 | 1.100170e+12 | 110017 | 123968 | Alameda | Alameda County Office of Education | Community School for Creative Education | Elementary | ... | 0.4 | 0 | 0.0 | 0 | 0.0 | 143 | 59.299999 | 34 | 14.100000 | POINT (-122.23863 37.78464) |
| 2 | 3 | 2019 | 6.910511e+10 | 1.100170e+12 | 110017 | 124172 | Alameda | Alameda County Office of Education | Yu Ming Charter | Elementary | ... | 0.0 | 0 | 0.0 | 0 | 0.0 | 65 | 14.600000 | 28 | 6.300000 | POINT (-122.28362 37.84737) |
| 3 | 4 | 2019 | 6.910511e+10 | 1.100170e+12 | 110017 | 125567 | Alameda | Alameda County Office of Education | Urban Montessori Charter | Elementary | ... | 0.2 | 3 | 0.7 | 0 | 0.0 | 133 | 30.799999 | 35 | 8.100000 | POINT (-122.18987 37.77846) |
| 4 | 5 | 2019 | 6.910511e+10 | 1.100170e+12 | 110017 | 130401 | Alameda | Alameda County Office of Education | Alameda County Juvenile Hall/Court | Juvenile Court | ... | 11.3 | 6 | 11.3 | 0 | 0.0 | 46 | 100.000000 | 13 | 28.299999 | POINT (-122.11822 37.71595) |
5 rows × 48 columns
CA_counties = gpd.read_file("CA_Counties_TIGER2016.shp")
CA_counties.head()
| STATEFP | COUNTYFP | COUNTYNS | GEOID | NAME | NAMELSAD | LSAD | CLASSFP | MTFCC | CSAFP | CBSAFP | METDIVFP | FUNCSTAT | ALAND | AWATER | INTPTLAT | INTPTLON | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 06 | 091 | 00277310 | 06091 | Sierra | Sierra County | 06 | H1 | G4020 | None | None | None | A | 2468694587 | 23299110 | +39.5769252 | -120.5219926 | POLYGON ((-13431319.751 4821511.426, -13431312... |
| 1 | 06 | 067 | 00277298 | 06067 | Sacramento | Sacramento County | 06 | H1 | G4020 | 472 | 40900 | None | A | 2499183617 | 76073827 | +38.4500114 | -121.3404409 | POLYGON ((-13490651.476 4680831.603, -13490511... |
| 2 | 06 | 083 | 00277306 | 06083 | Santa Barbara | Santa Barbara County | 06 | H1 | G4020 | None | 42200 | None | A | 7084000598 | 2729814515 | +34.5370572 | -120.0399729 | MULTIPOLYGON (((-13423116.772 4042044.149, -13... |
| 3 | 06 | 009 | 01675885 | 06009 | Calaveras | Calaveras County | 06 | H1 | G4020 | None | None | None | A | 2641820834 | 43806026 | +38.1838996 | -120.5614415 | POLYGON ((-13428575.483 4627725.228, -13428534... |
| 4 | 06 | 111 | 00277320 | 06111 | Ventura | Ventura County | 06 | H1 | G4020 | 348 | 37100 | None | A | 4773390489 | 945942791 | +34.3587415 | -119.1331432 | MULTIPOLYGON (((-13317853.594 3931602.414, -13... |
Let’s look at the shape of the shapefiles for 'CA_schools' and 'CA_counties'.
print(CA_schools.shape)
print(CA_counties.shape)
(10051, 48) (58, 18)
You may also use matplotlib for plotting to generate an overview of your GeoDataFrame.
plt.subplots(), "fig" short for figure, and "ax" short for axes.figsize = (12,12), the first number corresponding to width, the X axis, and the second corresponding to height, the Y axis. ax = ax sets axes on which to draw the plot.
fig, ax = plt.subplots(figsize=(12, 12))
CA_schools.plot(ax=ax)
<AxesSubplot:>
fig, ax = plt.subplots(figsize=(12, 12))
CA_counties.plot(ax=ax)
<AxesSubplot:>
3) Before spatial join, use the .crs attribute to check the current Coordinate Reference System (CRS)/projection of your spatial datasets, and if they are not projected into the same coordinate reference system, use the to_crs() method to re-project the data to a projection appropriate for the geographical area of your data. This is because the geopandas package can only carry out a spatial join between layers that are in the same projected coordinate.
In this example, 'CA_schools' and 'CA_counties' are the two GeoDataFrame that we need to examine their projections (.crs). Since the CRS of the two GeoDataFrames are different, EPSG:4326 for the 'CA_schools' and EPSG:3857 for the 'CA_counties', we must convert the CRS. A bit of research suggests that EPSG:3310 would be an appropriate projection for California state. Therefore, we convert the CRS of 'CA_schools' and 'CA_counties' to EPSG:3310 (.to_crs('epsg:3310')).
For more information and resources on coordinate systems and map projections, please see Appendix 1 in NYU Data Services’ QGIS tutorial, which is available here.
CA_schools.crs
<Geographic 2D CRS: EPSG:4326> Name: WGS 84 Axis Info [ellipsoidal]: - Lat[north]: Geodetic latitude (degree) - Lon[east]: Geodetic longitude (degree) Area of Use: - name: World - bounds: (-180.0, -90.0, 180.0, 90.0) Datum: World Geodetic System 1984 - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
CA_counties.crs
<Projected CRS: EPSG:3857> Name: WGS 84 / Pseudo-Mercator Axis Info [cartesian]: - X[east]: Easting (metre) - Y[north]: Northing (metre) Area of Use: - name: World - 85°S to 85°N - bounds: (-180.0, -85.06, 180.0, 85.06) Coordinate Operation: - name: Popular Visualisation Pseudo-Mercator - method: Popular Visualisation Pseudo Mercator Datum: World Geodetic System 1984 - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
CA_schools = CA_schools.to_crs('epsg:3310')
CA_counties = CA_counties.to_crs('epsg:3310')
4) Once the two GeoDataFrames have been converted into the same coordinate reference system, you can use the sjoin() method to carry out a spatial join between the two layers.
how argument specifies the type of join that will occur and which geometry is retained in the resultant GeoDataFrame, which can take the values left, right, or inner. By specifying how='left', we keep all rows from the 'CA_counties' and duplicate them if necessary to represent multiple hits between the two dataframes, and we retain only the geometry column of the left GeoDataFrames, i.e. 'CA_counties'. op argument specifies how geopandas decides whether or not to join the attributes of one object to another, based on their geometric relationship. By specifying op = 'contains' we can filter the joined results down to just those that counties contain the school points. join = gpd.sjoin(CA_counties, CA_schools, how = 'inner', op = 'contains')
Optionally, to have a sanity-check of the spatial join result, you can use the .shape attribute to check the number of rows and columns, and use the head() method to return the first 5 rows of geometry column of the new GeoDataFrame. The output of the sjoin() operation is a GeoDataFrame with one row for each school that is located inside a county (one of the 10051 schools is discarded), the polygon geometry of the counties that contain the schools, and all of the attribute columns from both input GeoDataFrames (48 + 18 = 66).
join.shape
(10050, 66)
join.geometry.head()
0 POLYGON ((-56192.570 186608.420, -56187.298 18... 0 POLYGON ((-56192.570 186608.420, -56187.298 18... 0 POLYGON ((-56192.570 186608.420, -56187.298 18... 0 POLYGON ((-56192.570 186608.420, -56187.298 18... 0 POLYGON ((-56192.570 186608.420, -56187.298 18... Name: geometry, dtype: geometry
5) Now that the attribute tables of the two spatial objects have been joined together, we would like to determine how many socioeconomically disadvantaged students were in each county of California state in 2019. We can use the groupby() method to group the GeoDataFrame 'join' by state names ( ['NAME'] ), and then use the sum() method to sum the values in the column SEDcont ( ['SEDcont'] ). To have a look at the top few counties with the most socioeconomically disadvantaged students, we can use the sort_values() method with the argument ascending = False to sort the results to be descending, and then use the head() method to show the first five rows.
CA = join.groupby(['NAME']).sum()['SEDcont']
CA.sort_values(ascending = False).head()
NAME Los Angeles 1020525 San Bernardino 293210 Riverside 283627 San Diego 261812 Orange 240190 Name: SEDcont, dtype: int64
In conclusion, the calculation result shows that the county having the most socioeconomically disadvantaged students in 2019 is Los Angeles county.